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Abstract 



Although the standard formulations of prediction problems involve fully-observed and 
noiseless data drawn in an i.i.d. manner, many applications involve noisy and/or missing 
data, possibly involving dependence, as well. We study these issues in the context of high- 
dimensional sparse linear regression, and propose novel estimators for the cases of noisy, 
missing, and/or dependent data. Many standard approaches to noisy or missing data, 
such as those using the EM algorithm, lead to optimization problems that are inherently 
non-convex, and it is difficult to establish theoretical guarantees on practical algorithms. 
While our approach also involves optimizing non-convex programs, we are able to both 
analyze the statistical error associated with any global optimum, and more surprisingly, 
to prove that a simple algorithm based on projected gradient descent will converge in 
polynomial time to a small neighborhood of the set of all global minimizers. On the 
statistical side, we provide non-asymptotic bounds that hold with high probability for 
the cases of noisy, missing, and/or dependent data. On the computational side, we prove 
that under the same types of conditions required for statistical consistency, the projected 
gradient descent algorithm is guaranteed to converge at a geometric rate to a near-global 
minimizer. We illustrate these theoretical predictions with simulations, showing close 
agreement with the predicted scalings. 

Keywords: High-dimensional statistics; missing data; non-convexity; regularization; sparse 
linear regression; M-estimation. 

1 Introduction 

In standard formulations of prediction problems, it is assumed that the covariates are fully- 
observed and sampled independently from some underlying distribution. However, these 
assumptions are not realistic for many applications, in which covariates may be observed only 
partially, observed subject to corruption, or exhibit some type of dependency. Consider the 
problem of modeling the voting behavior of politicians: in this setting, votes may be missing 
due to abstentions, and temporally dependent due to collusion or "tit-for-tat" behavior. Sim- 
ilarly, surveys often suffer from the missing data problem, since users fail to respond to all 
questions. Sensor network data also tends to be both noisy due to measurement error, and 
partially missing due to failures or drop-outs of sensors. 

There are a variety of methods for dealing with noisy and/or missing data, including 
various heuristic methods, as well as likelihood-based methods involving the expectation- 
maximization (EM) algorithm (e.g., see the book [ ] and references therein). A challenge in 
this context is the possible non-convexity of associated optimization problems. For instance. 
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in applications of EM, problems in which the negative likelihood is a convex function often 
become non-convex with missing or noisy data. Consequently, although the EM algorithm 
will converge to a local minimum, it is difficult to guarantee that the local optimum is close 
to a global minimum. 

In this paper, we study these issues in the context of high-dimensional sparse linear 
regression — in particular, in the case when the predictors or covariates are noisy, missing, 
and/or dependent. Our main contribution is to develop and study simple methods for han- 
dling these issues, and to prove theoretical results about both the associated statistical error 
and the optimization error. Like EM-based approaches, our estimators are based on solving 
optimization problems that may be non-convex; however, despite this non-convexity, we are 
still able to prove that a simple form of projected gradient descent will produce an output that 
is "sufficiently close" — as small as the statistical error — to any global optimum. As a second 
result, we bound the statistical error, showing that it has the same scaling as the minimax 
rates for the classical cases of perfectly observed and independently sampled covariates. In 
this way, we obtain estimators for noisy, missing, and/or dependent data that have the same 
scaling behavior as the usual fully-observed and independent case. The resulting estimators 
allow us to solve the problem of high-dimensional Gaussian graphical model selection with 
missing data. 

There is a large body of work on the problem of corrupted covariates or error-in- variables 
for regression problems (e.g., see the papers and books [7, 3, 8, 25], as well as references 
therein). Much of the earlier theoretical work is classical in nature, meaning that it requires 
that the sample size n diverges with the dimension p fixed. Most relevant to this paper is more 
recent work that has examined issues of corrupted and/or missing data in the context of high- 
dimensional sparse linear models, allowing for n <^ p. Stadler and Blihlmann [ ] developed an 
EM-based method for sparse inverse covariance matrix estimation in the missing data regime, 
and used this result to derive an algorithm for sparse linear regression with missing data. As 
mentioned above, however, it is difficult to guarantee that EM will converge to a point close 
to a global optimum of the likelihood, in contrast to the methods studied here. Rosenbaum 
and Tsybakov [ ] studied the sparse linear model when the covariates are corrupted by noise, 
and proposed a modified form of the Dantzig selector (see the discussion following our main 
results for a detailed comparison to this past work, and also to concurrent work [ s] by the 
same authors). For the particular case of multiplicative noise, the type of estimator that we 
consider here has been studied in past work [Jb]; however, this theoretical analysis is of the 
classical type, holding only for p, in contrast to the high-dimensional models that are of 
interest here. 

The remainder of this paper is organized as follows. We begin in Section 2 with background 
and a precise description of the problem. We then introduce the class of estimators we will 
consider and the form of the projected gradient descent algorithm. Section 3 is devoted 
to a description of our main results, including a pair of general theorems on the statistical 
and optimization error, and then a series of corollaries applying our results to the cases of 
noisy, missing, and dependent data. In Section 4, we demonstrate simulations to confirm that 
our methods work in practice, and verify the theoretically-predicted scaling laws. Section 5 
contains proofs of some of the main results, with the remaining proofs contained in the 
supplementary Appendix. 

Notation. For a matrix M, we write ||M||max := maxjj- \mij\ to be the elementwise ^oo- 
norm of Af. Furthermore, |||-/Vf|||i denotes the induced £i-operator norm (maximum absolute 
column sum) of M, and |||M|||op is the spectral norm of M. We write k{M) := ^""""(^\ , the 
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condition number of M. For matrices Mi, M2, we write M1QM2 to denote the componentwise 
Hadamard product, and write Mi©M2 to denote componentwise division. For functions /(n) 

and g{n), we write f{n) ^ g{n) to mean that /(n) < cg{n) for a universal constant c € (0, 00), 
and similarly, f{n) ^ g{n) when f{n) > c' g{n) for some universal constant d G (0, 00). Finally, 
we write /(n) x ^(n) when /(n) ;:j 5((n) and /(n) ^ (7(n) hold simultaneously. 

2 Background and problem setup 

In this section, we provide background and a precise description of the problem, and then 
motivate the class of estimators analyzed in this paper. We then discuss a simple class of 
projected gradient descent algorithms that can be used to obtain an estimator. 

2.1 Observation model and high-dimensional framework 

Suppose we observe a response variable G M linked to a covariate vector Xj G W via the 
linear model 

Ui = {xi, P*) + ei, ioT 1 = 1,2,..., n. (1) 

Here, the regression vector /3* G is unknown, and G M is observation noise, independent 
of Xj. Rather than directly observing each Xj G W, we observe a vector Zi G W linked to Xi 
via some conditional distribution, i.e., 

Zi^Qi- I Xi), for z = l,2,...,n. (2) 

This setup applies to various disturbances to the covariates, including: 

(a) Covariates with additive noise: We observe Zi = xi + Wi, where Wi G is a random 
vector independent of Xi, say zero-mean with known covariance matrix Ti^,. 

(b) Missing data: For some fraction p G [0, 1), we observe a random vector Zj G MP such 
that for each component j, we independently observe zij = Xij with probability 1 — p, 
and Zij = * with probability p. We can also consider the case when the entries in the 
j^^ column have a different probability pj of being missing. 

(c) Covariates with multiplicative noise: Generalizing the missing data problem, suppose 
we observe Zi = Xi Q Ui, where Ui G MP is again a random vector independent of Xi, 
and is the Hadamard product. The problem of missing data is a special case of 
multiplicative noise, where all ■Ujj's are independent and Uij ^ Bernoulli(l — pj). 

Our first set of results is deterministic, depending on specific instantiations of the observations 
{(z/i) However, we are also interested in results that hold with high probability when 
the Xi's and Zj's are drawn at random. We consider both the case when the x^'s are drawn i.i.d. 
from a fixed distribution; and the case of dependent covariates, when the generated 
according to a stationary vector autoregressive (VAR) process. 

We work within a high-dimensional framework that allows the number of predictors p to 
grow and possibly exceed the sample size n. Of course, consistent estimation when n <C p is 
impossible unless the model is endowed with additional structure — for instance, sparsity in 
the parameter vector /3* . Consequently, wc study the class of models where (3* has at most k 
non-zero parameters, where k is also allowed to increase to infinity with p and n. 
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2.2 A/-estimators for noisy and missing covariates 

In order to motivate the class of estimators we will consider, let us begin by examining a 
simple deterministic problem. Let S^; ;^ be the covariance matrix of the covariates, and 
consider the ^i-constrained quadratic program 

^G arg min {J/3^S,/3- (S,/3*, (3) 

As long as the constraint radius R is at least ||/3* ||i, the unique solution to this convex program 
is /3 = /3*. Of course, this program is an idealization, since in practice we may not know the 
covariance matrix S^^, and we certainly do not know Ti^P* — after all, f3* is the quantity we 
are trying to estimate! 

Nonetheless, this idealization still provides useful intuition, as it suggests various estima- 
tors based on the plug-in principle. Given a set of samples, it is natural to form estimates of 
the quantities 'Sx and which we denote by F G W^'f and 7 G W^, respectively, and to 

consider the modified program 

^Garg min { Vf/3 - (7, /3)}, (4) 
||/3j|i</? / 



or alternatively, the regularized version 
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P G arg min {-f^^Vf] - (7, /3) + A„||/3||i}, (5) 



2 

where An > is a user-defined regularization parameter. Note that the two problems are 
equivalent by Lagrangian duality when the objectives are convex, but not in the case of a 
non-convex objective. The Lasso [22, 4] is a special case of these programs, obtained by 
setting 

ftas := -X'^X and := -X'^y, (6) 

n n 

where we have introduced the shorthand y = (yi, . . . , yn)^ £ I^"; and X G M^^^, with xf as 
its i^^ row. A simple calculation shows that (rLas,7Las) are unbiased estimators of the pair 
(Sj:, TixP*)- This unbiasedness and additional concentration inequalities (to be described in 
the sequel) underlie the well-known analysis of the Lasso in the high-dimensional regime. 

In this paper, we focus on more general instantiations of the programs (4) and (5), involving 
different choices of the pair (r,7) that are adapted to the cases of noisy and/or missing data. 
Note that the matrix LLas is positive semidefinite, so the Lasso program is convex. In sharp 
contrast, for the case of noisy or missing data, the most natural choice of the matrix F is not 
positive semidefinite, hence the quadratic losses appearing in the problems (4) and (5) are 
non-convex. Furthermore, when F has negative eigenvalues, the objective in equation (5) is 
unbounded from below. Hence, we make use of the following regularized estimator: 

^Garg min { Vf /3 - (7, /3) + An||/3||i}, (7) 
\\l3\\i<boVk 2 

for a suitable constant 60 • 

In the presence of non-convexity, it is generally impossible to provide a polynomial-time 
algorithm that converges to a (near) global optimum, due to the presence of local minima. 
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Remarkably, we are able to prove that this issue is not significant in our setting, and a simple 
projected gradient descent algorithm applied to the programs (4) or (7) converges with high 
probability to a vector extremely close to any global optimum. 

Let us illustrate these ideas with some examples. Recall that (r,7) serve as unbiased estima- 
tors for 

Example 1 (Additive noise). Suppose we observe Z = X + W, where is a random 
matrix independent of X, with rows Wi drawn i.i.d. from a zero-mean distribution with known 
covariance T,w We consider the pair 

f add := -Z^Z - and 7add := -Z'^U- (8) 
n n 

Note that when = (corresponding to the noiseless case), the estimators reduce to the 
standard Lasso. However, when S^, ^ 0, the matrix F^dd is not positive semidefinite in the 
high- dimensional regime [n <^ p). Indeed, since the matrix ^Z'^Z has rank at most n, the 
subtracted matrix may cause F^dd to have a large number of negative eigenvalues. For 
instance, if = cr^/ for cr^ > 0, then F^dd has p — n eigenvalues equal to —a'^. 

Example 2 (Missing data). We now consider the case where the entries of X are missing at 
random. Let us first describe an estimator for the special case where each entry is missing at 
random, independently with some constant probability p S [0, 1). (In Example 3 to follow, 
we will describe the extension to general missing probabilities.) Consequently, we observe the 
matrix Z G M"^^ with entries 




Xij with probability 1 — p, 
otherwise. 



Given the observed matrix Z G M"^^, we use 

^ 2^ 2i Z'^ Z 1 ~ 

r„,i, := p diag ( ) and 7„,i, := -Z^y, (9) 

n n n 

where Zjj = Zij /{I— p). It is easy to see that the pair (Fnjis,7^is) reduces to the pair (FLas,7Las) 
for the standard Lasso when p = 0, corresponding to no missing data. In the more interesting 
case when p £ (0, 1), the matrix in equation (9) has rank at most n, so the subtracted 
diagonal matrix may cause the matrix F^i^ to have a large number of negative eigenvalues 
when n <^ p. As a consequence, the matrix T^i^ is not (in general) positive semidefinite, so 
the associated quadratic function is not convex. 

Example 3 (Multiplicative noise). As a generalization of the previous example, we now 
consider the case of multiplicative noise. In particular, suppose we observe the quantity 
Z = X Q U , where C/ is a matrix of nonnegative noise variables. In many applications, it 
is natural to assume that the rows Ui of U are drawn in an i.i.d. manner, say from some 
distribution in which both the vector E[ui] and the matrix E[iiiM^] have strictly positive 
entries. This general family of multiplicative noise models arises in various applications; we 
refer the reader to the papers [7, 3, 8, 25] for more discussion and examples. A natural choice 
of the pair (F,7) is given by the quantities 

f^,, := -Z^Z©E(ninf) and f := -Z^y © E(ni), (10) 
n n 
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where © denotes elementwise division. A small calculation shows that these are unbiased 
estimators of Ti^ and 'S^fS*, respectively. The estimators (10) have been studied in past 
work [25], but only under classical scaling {n ^ p). 

As a special case of the estimators (10), suppose the entries Uij of U are independent 
Bernoulli (1 — pj) random variables. Then the observed matrix Z = X Q U corresponds to a 
missing-data matrix, where each element of the j^^ column has probability pj of being missing. 
In this case, the estimators (10) become 

- Z'^Z 1 

r„,i, = © M and 7„,i3 = -Z^y © (1 - p), (11) 

n n 

where M := ¥,{uiuj) satisfies 

M, = /(I ft) '"^i 

p is the parameter vector containing the /o/s, and 1 is the vector of all I's. In this way, we 
obtain a generalization of the estimator discussed in Example 2. 



2.3 Restricted eigenvalue conditions 

Given an estimate f3, there are various ways to assess its closeness to (3*. In this paper, we 
focus on the ^2-norm — /?*||2, as well as the closely related £i-norm ||/3 — When 
the covariate matrix X is fully observed (so that the Lasso can be applied), it is now well 
understood that a sufficient condition for £2-recovery is that the matrix Fl^s = ^ satisfy 
a certain type of restricted eigenvalue (RE) condition (e.g., [2, 23]). In this paper, we make 
use of the following condition. 

Definition 1 (Lower- RE condition). The matrix F satisfies a lower restricted eigenvalue 
condition with curvature ai > and tolerance r(n,p) > if 

e'^f6>ai\\e\\l - T{n,p)\\e\\l io^alie^W. (12) 

It can be shown that when the Lasso matrix T-^^^ = ^X^X satisfies this RE condition (12), 
the Lasso estimate has low £2-error for any vector (3* supported on any subset of size at 
most k < . In particular, bound (12) implies a sparse RE condition for all k of this 

magnitude, and conversely. Lemma 11 in the Appendix shows that a sparse RE condition 
implies bound (12). In this paper, we work with condition (12), since it is especially convenient 
for analyzing optimization algorithms. 

In the standard setting (with uncorrupted and fully observed design matrices), it is known 
that for many choices of the design matrix X (with rows having covariance S), the Lasso 
matrix FLas will satisfy such an RE condition with high probability (e.g., [16, 20]) with 
Oil = ^Amin(S) and T{n,p) x A significant portion of the analysis in this paper is 

devoted to proving that different choices of F, such as the matrices F^dd and T^^^ defined ear- 
lier, also satisfy condition (12) with high probability. This fact is by no means obvious, since 
as previously discussed, the matrices F^dd and F^^is generally have large numbers of negative 
eigenvalues. 

Finally, although such upper bounds are not necessary for statistical consistency, our 
algorithmic results make use of the analogous upper restricted eigenvalue condition, formalized 
in the following: 
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Definition 2 (Upper- RE condition). The matrix T satisfies an upper restricted eigenvalue 
condition with smoothness 02 > and tolerance T{n,p) > if 

e^fe <a2\\e\\l + T{n,p)\\e\\l for an 6* grp. (13) 

In recent work on high-dimensional projected gradient descent, Agarwal et al. [ ] make use of 
a more general form of the lower and upper bounds (12) and (13), applicable to non-quadratic 
losses as well, which are referred to as the restricted strong convexity (RSC) and restricted 
smoothness (RSM) conditions, respectively. For various class of random design matrices, it 
can be shown that the Lasso matrix Ti^^^ satisfies the upper bound (13) with 02 = 2Aniax(Sx) 
and T{n,p) x ^^^'1 see Raskutti et al. [ ] for the Gaussian case and Rudelson and Zhou [20] 
for the sub-Gaussian setting. We will establish similar scaling for our choices of T. 

2.4 Gradient descent algorithms 

In addition to proving results about the global minima of the (possibly non-convex) pro- 
grams (4) and (5), we are also interested in polynomial-time procedures for approximating 
such optima. In this paper, we analyze some simple algorithms for solving either the con- 
strained program (4) or the Lagrangian version (7). Note that the gradient of the quadratic 
loss function takes the form V£(/3) = Tf3 — 7. In application to the constrained version, the 
method of projected gradient descent generates a sequence of iterates {(3\ t = 0, 1, 2, . . .} by 
the recursion 

/3*+i = arg inin {£(/3*) + (V£(/3*), /3 - + - /3'g}, (14) 

where r/ > is a stepsize parameter. Equivalently, this update can be written as {3^^^ = 
n(/3* — ^V£(/3*)), where 11 denotes the ^2-pi'ojection onto the ^i-ball of radius R. This 
projection can be computed rapidly in 0{p) time using a procedure due to Duchi et al. [5]. 
For the Lagrangian update, we use a slight variant of the projected gradient update (14), 
namely 

/3*+i = arg min {£(/3*) + (V£(/3*), /? - /3*) + ^||/3 - + An||/3||i}, (15) 

with the only difference being the inclusion of the regularization term. This update can also 
performed efficiently by performing two projections onto the ^i-ball (see the paper [ ] for 
details). 

When the objective function is convex (equivalently, F is positive semidefinite) , the iter- 
ates (14) or (15) are guaranteed to converge to a global minimum of the objective functions (4) 
and (7), respectively. In our setting, the matrix F need not be positive semidefinite, so the 
best generic guarantee is that the iterates converge to a local optimum. However, our analysis 
shows that for the family of programs (4) or (7), under a reasonable set of conditions satisfied 
by various statistical models, the iterates actually converge to a point extremely close to any 
global optimum in both £i-norm and ^2-norm; see Theorem 2 to follow for a more detailed 
statement. 

3 Main results and consequences 

We now state our main results and discuss their consequences for noisy, missing, and depen- 
dent data. 
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3.1 General results 



We provide theoretical guarantees for both the constrained estimator (4) and the Lagrangian 
version (7). Note that we obtain different optimization problems as we vary the choice of 
the pair (r,7) G MP^^ x MP. We begin by stating a pair of general results, applicable to any 
pair that satisfies certain conditions. Our first result (Theorem 1) provides bounds on the 
statistical error, namely the quantity ||/3 — /3*||2, as well as the corresponding £i-error, where 
(3 is any global optimum of the programs (4) or (7). Since the problem may be non-convex 
in general, it is not immediately obvious that one can obtain a provably good approximation 
to any global optimum without resorting to costly search methods. In order to assuage this 
concern, our second result (Theorem 2) provides rigorous bounds on the optimization error, 
namely the differences ||/3* — /3||2 and — /3||i incurred by the iterate /3* after running t 
rounds of the projected gradient descent updates (14) or (15). 



3.1.1 Statistical error 

In controlling the statistical error, we assume that the matrix T satisfies a lower-RE condition 
with curvature ai and tolerance T{n,p), as previously defined (12). Recall that T and 7 serve 
as surrogates to the deterministic quantities E-^ G RP^p and TixP* G M^, respectively. Our 
results also involve a measure of deviation in these surrogates. In particular, we assume that 
there is some function ip(Q,a^), depending on the two sources of noise in our problem: the 
standard deviation of the observation noise vector e from equation (1), and the conditional 
distribution Q from equation (2) that links the covariates Xi to the observed versions z,. With 
this notation, we consider the deviation condition 



||7-rriloo<v'(Q,a,)-y^. (16) 

To aid intuition, note that inequality (16) holds whenever the following two deviation condi- 
tions are satisfied: 



II7-S./3II00 < V5(Q,fTe)V— and ||(r-S,)/3*|U <(^(Q,^e)V— . (17) 

V n \ n 

The pair of inequalities (17) clearly measures the deviation of the estimators (r,7) from 
their population versions, and they are sometimes easier to verify theoretically. However, 
inequality (16) may be used directly to derive tighter bounds (e.g., in the additive noise case). 
Indeed, the bounds established via inequalities (17) is not sharp in the limit of low noise on 
the covariates, due to the second inequality. In the proofs of our corollaries to follow, we will 
verify the deviation conditions for various forms of noisy, missing, and dependent data, with 
the quantity (/?(Q, 0"^) changing depending on the model. We have the following result, which 

applies to any global optimum f3 of the regularized version (7) with An > 4(/?(Q, '^fi)\f^^^'- 

Theorem 1 (Statistical error). Suppose the surrogates (r,7) satisfy the deviation hound (16), 
and the matrix T satisfies the lower-RE condition (12) with parameters (ai,T) such that 

Vfcr(n,^,)<min{^, f^y (18) 
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Then for any vector (3* with sparsity at most k, there is a universal positive constant cq such 
that any global optimum (3 of the Lagrangian program (7) with any bo > ||/3*||2 satisfies the 
bounds 



ll^-rib < — max{vp(Q,a,) Ji^, A„}, and (19a) 
ai V n 

< ^ max{99(Q,a,)y^, A„}. (19b) 
The same bounds (without A„) also apply to the constrained program (4) with radius choice 

R=\W*\\i- 



Remarks To be clear, all the claims of Theorem 1 are deterministic. Probabilistic conditions 
will enter when we analyze specific statistical models and certify that the RE condition (18) 
and deviation conditions are satisfied by a random pair (r,7) with high probability. We 
note that for the standard Lasso choice (Flj^s, 7l;,J of this matrix-vector pair, bounds of the 
form (19) for sub-Gaussian noise are well known from past work (e.g., [2, 27, 13, I I]). The nov- 
elty of Theorem 1 is in allowing for general pairs of such surrogates, which — as shown by the 
examples discussed earlier — can lead to non-convexity in the underlying M-estimator. More- 
over, some interesting differences arise due to the term (/^(Q, de), which changes depending on 
the nature of the model (missing, noisy, and/or dependent). As will be clarified in the sequel, 
proving that the conditions of Theorem 1 are satisfied with high probability for noisy/missing 
data requires some non-trivial analysis, involving both concentration inequalities and random 
matrix theory. 

Note that in the presence of non-convexity, it is possible in principle for the optimization 
problems (4) and (7) to have many global optima that are separated by large distances. 
Interestingly, Theorem 1 guarantees that this unpleasant feature does not arise under the 
stated conditions: given any two global optima (3 and f3 of the program (4), Theorem 1 
combined with the triangle inequality guarantees that 

\\d-M2<0-n + 0-n <2co^-^J^ 

ai \ n 

and similarly for the program (7). Consequently, under any scaling such that = o(l), 

the set of all global optima must lie within an ^2-ball whose radius shrinks to zero. 

In addition, it is worth observing that Theorem 1 makes a specific prediction for the scaling 
behavior of the ^2-error ||/3 — /3*||2. In order to study this scaling prediction, we performed 
simulations under the additive noise model described in Example 1, using the parameter 
setting Tlx = I and = o"^/ with = 0.2. Panel (a) of Figure 1 provides plots^ of the 
error ||/3— /3* ||2 versus the sample size n, for problem dimensions p S {128, 256, 512}. Note that 
for all three choices of dimensions, the error decreases to zero as the sample size n increases, 
showing consistency of the method. The curves also shift to the right as the dimension p 
increases, refiecting the natural intuition that larger problems are harder in a certain sense. 
Theorem 1 makes a specific prediction about this scaling behavior: in particular, if we plot 
the ^2-error versus the rescaled sample size n/{klogp), the curves should roughly align for 

^Corollary 1, to be stated shortly, guarantees that the conditions of Theorem 1 are satisfied with high 
probabihty for the additive noise modeL In addition, Theorem 2 to follow provides an efficient method of 
obtaining an accurate approximation of the global optimum. 
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Additive noise 




p=128 
p=256 
p=512 



Additive noise 




Figure 1. Plots of the error — P*\\2 after running projected gradient descent on the non- 
convex objective, with sparsity k « ^fp. Plot (a) is an error plot for i.i.d. data with additive 
noise, and plot (b) shows £2-6rror versus the rescaled sample size ^ ^"^ ^ . As predicted by 
Theorem 1, the curves align for different values of p in the rescaled plot. 

different values of p. Panel (b) shows the same data re-plotted on these rescaled axes, thus 
verifying the predicted "stacking behavior." 

Finally, as noted by a reviewer, the constraint R = in the program (4) is rather 

restrictive, since /3* is unknown. Theorem 1 merely establishes a heuristic for the scaling 
expected for this optimal radius. In this regard, the Lagrangian estimator (7) is more ap- 
pealing, since it only requires choosing 6o to be larger than ||/3*||2, and the conditions on the 
regularizer Xn are the standard ones from past work on the Lasso. 

3.1.2 Optimization error 

Although Theorem 1 provides guarantees that hold uniformly for any global minimizer, it does 
not provide guidance on how to approximate such a global minimizer using a polynomial- 
time algorithm. Indeed, for non-convex programs in general, gradient-type methods may 
become trapped in local minima, and it is impossible to guarantee that all such local minima 
are close to a global optimum. Nonetheless, we are able to show that for the family of 
programs (4), under reasonable conditions on T satisfied in various settings, simple gradient 
methods will converge geometrically fast to a very good approximation of any global optimum. 
The following theorem supposes that we apply the projected gradient updates (14) to the 
constrained program (4), or the composite updates (15) to the Lagrangian program (7), with 
stepsize rj = 2a2- In both cases, we assume that n ^ k\ogp, as is required for statistical 
consistency in Theorem 1. 

Theorem 2 (Optimization error). Under the conditions of Theorem 1: 

(a) For any global optimum (3 of the constrained program (4), there are universal positive 
constants (ci,C2) and a contraction coefficient 7 G (0,1), independent of {n,p,k), such 
that the gradient descent iterates (14) satisfy the bounds 

11/3* - M < 7*11/3° - M + cii^ll^- /3*||? + C20-f3*\\l (20) 

n 

11/3* - ^lli < 2Vk 11/3* - M2 + 11^-/3*112 + 2 11^- /3*||i, (21) 
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for all t > 0. 

(b) Letting (f) denote the objective function of Lagrangian program (7) with global optimum 
(3, and applying composite gradient updates (15), there are universal positive constants 
(ci,C2) and a contraction coefficient^ G (0)1); independent of {n,p,k), such that 

11/3* - <ci 11^-/3* 11^ for all iterates t>T, (22) 

^ V ' 

where T := C2 log ^^^^"^^^'^^^^^ / log(l/7) . 

Remarks As with Theorem 1, these claims are deterministic in nature. Probabilistic condi- 
tions will enter into the corollaries, which involve proving that the surrogate matrices F used 
for noisy, missing, and/or dependent data satisfy the lower- and upper- RE conditions with 
high probability. The proof of Theorem 2 itself is based on an extension of a result due to 
Agarwal et al. [ .] on the convergence of projected gradient descent and composite gradient 
descent in high dimensions. Their result as originally stated imposed convexity of the loss 
function, but the proof can be modified so as to apply to the non-convex loss functions of 
interest here. As noted following Theorem 1, all global minimizers of the non-convex pro- 
gram (4) lie within a small ball. In addition. Theorem 2 guarantees that the local minimizers 
also lie within a ball of the same magnitude. Note that in order to show that Theorem 2 can 
be applied to the specific statistical models of interest in this paper, a considerable amount of 
technical analysis remains in order to establish that its conditions hold with high probability. 

In order to understand the significance of the bounds (20) and (22), note that they provide 
upper bounds for the £2-distance between the iterate /3* at time t, which is easily computed in 
polynomial-time, and any global optimum /3 of the program (4) or (7), which may be difficult 
to compute. Focusing on bound (20), since 7 G (0, 1), the first term in the bound vanishes as t 
increases. The remaining terms involve the statistical errors ||/3 — /3* ||q, for q = 1,2, which are 
controlled in Theorem 1. It can be verified that the two terms involving the statistical error on 
the right-hand side are bounded as 0( ^^°^^ ), so Theorem 2 guarantees that projected gradient 
descent produce an output that is essentially as good — in terms of statistical error — as any 
global optimum of the program (4). Bound (22) provides a similar guarantee for composite 
gradient descent applied to the Lagrangian version. 

Experimentally, we have found that the predictions of Theorem 2 are borne out in simula- 
tions. Figure 2 shows the results of applying the projected gradient descent method to solve 
the optimization problem (4) in the case of additive noise (panel (a)), and missing data (panel 
(b)). In each case, we generated a random problem instance, and then applied the projected 
gradient descent method to compute an estimate /?. We then reapplied the projected gradient 
method to the same problem instance 10 times, each time with a random starting point, and 
measured the error ||/3* — /3||2 between the iterates and the first estimate (optimization error), 
and the error — /3*||2 between the iterates and the truth (statistical error). Within each 
panel, the blue traces show the optimization error over 10 trials, and the red traces show the 
statistical error. On the logarithmic scale given, a geometric rate of convergence corresponds 
to a straight line. As predicted by Theorem 2, regardless of the starting point, the iterates 
exhibit geometric convergence to the same fixed point. The statistical error contracts 
geometrically up to a certain point, then flattens out. 

^To be precise, Theorem 2 states that the iterates will converge geometrically to a small neighborhood of 
all the global optima. 
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Figure 2. Plots of the optimization error log(|l/3* — /3||2) and statistical error log(|l/3* — P*\\2) 
versus iteration number t, generated by running projected gradient descent on the non-convex 
objective. Each plot shows the solution path for the same problem instance, using 10 different 
starting points. As predicted by Theorem 2, the optimization error decreases geometrically. 



3.2 Some consequences 

As discussed previously, both Theorems 1 and 2 are deterministic results. Applying them to 
specific statistical models requires some additional work in order to establish that the stated 
conditions are met. We now turn to the statements of some consequences of these theorems 
for different cases of noisy, missing, and dependent data. In all the corollaries below, the 
claims hold with probability greater than 1 — ci exp(— C2 logp), where (ci,C2) are universal 
positive constants, independent of all other problem parameters. Note that in all corollaries, 
the triplet {n,p,k) is assumed to satisfy scaling of the form n ^ klogp, as is necessary for 
^2-consistent estimation of /c-sparse vectors in p dimensions. 

Definition 3. We say that a random matrix X G R"^^ is sub-Gaussian with parameters 
if: 

(a) each row xj G is sampled independently from a zero-mean distribution with covari- 
ance S, and 

(b) for any unit vector u ^ MP, the random variable u^Xi is sub-Gaussian with parameter 
at most a. 

For instance, if we form a random matrix by drawing each row independently from the 
distribution A^(0, S), then the resulting matrix X G M"^^' is a sub-Gaussian matrix with pa- 
rameters (s. III s III op). 



3.2.1 Bounds for additive noise: i.i.d. case 

We begin with the case of i.i.d. samples with additive noise, as described in Example 1. 

Corollary 1. Suppose that we observe Z = X + W , where the random matrices X,W G M'^^p 
are sub-Gaussian with parameters (Sj,.,^^), and let e be an i.i.d. sub-Gaussian vector with 
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parameter o"^. Let a1 = o"^. + o"^. Then under the scaling n ^3 max | p-^^y^^, ij/c logp, for 

^ min^ 

the M -estimator based on the surrogates (Xadd^^add), the results of Theorems 1 and 2 hold 
with parameters ai = \\min{^x) and (p{Q_,ae) = coaz{(7w + ae)\\f3*\\2, with probability at least 
1 - ci exp(-C2 logp). 

Remarks 

(a) Consequently, the ^2-error of any optimal solution (3 satisfies the bound 



I2 ^ 



crz{(^w + CTe) r,*.. klogp 



with high probability. The prefactor in this bound has a natural interpretation as an inverse 
signal-to-noise ratio; for instance, when X and W are zero-mean Gaussian matrices with 
row covariances S^, = a^I and = cr^/, respectively, we have Xmmi^x) = crl^ so 



XI 

2 ' 




Xmin{^x) 

This quantity grows with the ratios o'w/o'x and a^/o'x, which measure the SNR of the ob- 
served covariates and predictors, respectively. Note that when = 0, corresponding to 
the case of uncorrupted covariates, the bound on ^2-error agrees with known results. See 
Section 4 for simulations and further discussions of the consequences of Corollary 1. 

(b) We may also compare the results in (a) with bounds from past work on high-dimensional 
sparse regression with noisy covariates In this work, Rosenbaum and Tsybakov derive 
similar concentration bounds on sub-Gaussian matrices. The tolerance parameters are all 



C(y ^^^), with prefactors depending on the sub-Gaussian parameters of the matrices. In 
particular, in their notation. 



2^ /log^J, 



n 



leading to the bound (cf. Theorem 2 of Rosenbaum and Tsybakov [ ]) 



^m'm{^x) ^minV^xj V Tl 

Extensions to unknown noise covariance: Situations may arise where the noise co- 
variance Tiw is unknown, and must be estimated from the data. One simple method is to 
assume that is estimated from independent observations of the noise. In this case, sup- 
pose we independently observe a matrix Wq G M"^*' with n i.i.d. vectors of noise. Then we 
use = ^WqWo as our estimate of T,^. A more sophisticated variant of this method (cf. 
Chapter 4 of Carroll et al. [ ]) assumes that we observe ki replicate measurements Zn, . . . , Zj^ 
for each Xj and form the estimator 

Based on the estimator Tiyj, we form the pair (r,7) such that 7 = ^Z^y and F = — S^. 
In the proofs of Section 5, we will analyze the case where S^^, = ^^WqWq and show that the 
result of Corollary 1 still holds when must be estimated from the data. Note that the 
estimator in equation (23) will also yield the same result, but the analysis is more complicated. 
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3.2.2 Bounds for missing data: i.i.d. case 

Next, we turn to the case of i.i.d. samples with missing data, as discussed in Example 3. For 
a missing data parameter vector p, we define pmax := maxj pj, and assume Pmax < 1- 

Corollary 2. Let X £ M"^^ be sub-Gaussian with parameters (S^,(T^), and Z the missing 
data matrix with parameter p. Let e be an i.i.d. sub-Gaussian vector with parameter a^. If 
n >~ maxfv^ — — ^4 >2 m l)fclogp, then Theorems 1 and 2 hold with probability at least 
1 - ci exp(-C2logp) for qi = \\rain{^x) and (/?(Q,ctJ = CO i^p^^^ (tTe + l-pL, )ll/^*l|2- 

Remarks Suppose X is a Gaussian random matrix and pj = p for all j. In this case, the 
ratio T — "^^^ s = '^'"'"''.^fi? = KiTir) is the condition number of S^.. Then 

a ^ VAmin(S.)l-p^ II'' 

a quantity that depends on both the conditioning of S^., and the fraction p G [0, 1) of missing 
data. We will consider the results of Corollary 2 applied to this example in the simulations 
of Section 4. 



Extensions to unknown p: As in the additive noise case, we may wish to consider the 
case when the missing data parameters p are not observed and must be estimated from the 
data. For each j = 1, 2, . . . ,p, we estimate pj using pj, the empirical average of the number 
of observed entries per column. Let p G denote the resulting estimator of p. Naturally, 
we use the pair of estimators (F, 7) defined by 

f = ^^©M and 7= -Z^y©(l-p), (24) 
n n 

where 



(1 -Pj) ifi/j 
I - Pi if i = j. 



We will show in Section 5 that Corollary 2 holds when p is estimated by p. 



3.2.3 Bounds for dependent data 

Turning to the case of dependent data, we consider the setting where the rows of X are drawn 
from a stationary vector autoregressive (VAR) process according to 

Xi+i = Axi Vi, for i = 1,2, . . . ,n - 1, (25) 

where vi G is a zero- mean noise vector with covariance matrix S^,, and A G W^'P is a 
driving matrix with spectral norm |||^|||2 < 1- We assume the rows of X are drawn from a 
Gaussian distribution with covariance S^;, such that = ATix^ + S^,. Hence, the rows of 
X are identically distributed but not independent, with the choice ^ = giving rise to the 
i.i.d. scenario. Corollaries 3 and 4 correspond to the case of additive noise and missing data 
for a Gaussian VAR process. 
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Corollary 3. Suppose the rows of X are drawn according to a Gaussian VAR process with 
driving matrix A. Suppose the additive noise matrix W is i.i.d. with Gaussian rows, and 
let e he an i.i.d. sub-Gaussian vector with parameter a^. If max ( 2 , . , l^klogp, with 

mill V ^ ' 

('^ = |||5]«,|||op+ Yzj^jjp"; then Theorems 1 and 2 hold with probability at least 1—ci exp(— C2 logp) 
forai = ^Xmini^x) and f{Q,ae) = co(fT,C + C^) ||/3* lb- 
Corollary 4. Suppose the rows of X are drawn according to a Gaussian VAR process with 
driving matrix A, and Z is the observed matrix subject to missing data, with parameter p. Let e 
be an i.i.d. sub-Gaussian vector with parameter a^. If n'^ max {^^~^~t^i ^)klogp, with Q'"^ = 

mill ^ ^ ^ 

1 2|||Sa:|j[op tJig^ii Theorems 1 and 2 hold with probability at least 1 — ci exp(— C2 logp) 



-lll^lll 



op 



for ai = \\rain(^x) and ip{Q,ae) = Co(cTeC' + C'^ 
Remarks. Note that the scahng and the form of tp in Corollaries 2-4 are very similar, ex- 

2 

cept with different effective variances o"^ = j-^ — — — y^, C^, or C'^, depending on the type of 

\^ Pmax J 

corruption in the data. As we will see in Section 5, the proofs involve verifying the deviation 
conditions (17) using similar techniques. On the other hand, the proof of Corollary 1 proceeds 
via deviation condition (16), which produces a tighter bound. 

We may also extend the cases of dependent data to situations when S^, and p are unknown 
and must be estimated from the data. The proofs of these extensions are identical to the i.i.d 
case, so we will omit them. 



3.3 Application to graphical model inverse covariance estimation 

The problem of inverse covariance estimation for a Gaussian graphical model is also related 
to the Lasso. Meinshausen and Biihlmann [ ,] prescribed a way to recover the support of the 
precision matrix when each column of is /c-sparse, via linear regression and the Lasso. 
More recently. Yuan [26] proposed a method for estimating using the Dantzig selector, and 
obtained error bounds on |||0 — 0|||i when the columns of are bounded in ^l. Both of these 
results assume that X is fully-observed and has i.i.d. rows. 

Suppose we are given a matrix X G M"^p of samples from a multivariate Gaussian dis- 
tribution, where each row is distributed according to A^(0, E). We assume the rows of X are 
either i.i.d. or sampled from a Gaussian VAR process. Based on the modified Lasso of the 
previous section, we devise a method to estimate based on a corrupted observation matrix 
Z, when is sparse. Our method bears similarity to the method of Yuan [20] , but is valid 
in the case of corrupted data, and does not require an ii column bound. Let X^ denote the 
j^^ column of X, and let X~^ denote the matrix X with j'**^ column removed. By standard 
results on Gaussian graphical models, there exists a vector 9^ G M^"^ such that 

X^ = X-^e^ + e^, (26) 

where is a vector of i.i.d. Gaussians and _LL X~K Setting aj := — (Sjj — T,j-j9^)~^ , we 
can verify that = OjOK Our algorithm, described below, forms estimates 9^ and Oj for 

each j, then combines the estimates to obtain an estimate Qj-j = aj9K 

In the additive noise case, we observe the matrix Z = X + W . From the equations (26), 
we obtain Z^ = X~W^ -\- {e^ -\- W^). Note that 6^ = -\- is a vector of i.i.d. Gaussians, 
and since X _LL W, we have 5^ _LL X~^. Hence, our results on covariates with additive noise 
allow us to recover 9^ from Z. We can verify that this reduces to solving the program (4) 
or (7) with the pair (f (j), 7(j)) = (S_j _j, ^Z'^'^Z^), where S = ^Z'^Z - S^. 
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When Z is a missing-data version of X, we similarly estimate the vectors 6^ via equa- 
tion (26), using our results on the Lasso with missing covariates. Here, both covariates and 
responses are subject to missing data, but this makes no difference in our theoretical results. 
For each j, we use the pair 

(fo-),^o-)) = Iz-^^z^ © (1 - p-^-)(i - /,,)), 

where S = Z © M, and M is defined as in Example 3. 

To obtain the estimate 0, we therefore propose the following procedure, based on the 
estimators 7(-?))}^^^ and S. 

Algorithm 3.1. (1) Perform p linear regressions of the variables Z^ upon the remaining 
variables Z~^ , using the program (4) or (7) with the estimators (r(-'),7(-')), to obtain 
estimates 0^ of 9^ . 

(2) Estimate the scalars aj using the quantity aj := —{^jj — I^j,-j6^)^^ , based on the 
estimator S. Form Q with = ajO^ and Qjj = —Uj. 

(3) Set B = arg min |||0 — 0|||i, where 5"^ is the set of symmetric matrices. 

Note that the minimization in step (3) is a linear program, so is easily solved with standard 
methods. We have the following corollary about 0: 

Corollary 5. Suppose the columns of the matrix Q are k-sparse, and suppose the condition 
number k{@) is nonzero and finite. Suppose we have 



\T 



(i)_fO-)0i||^<^(Q,^j^l5^, Vj, (27) 



and suppose we have the following additional deviation condition on S.' 



||S-S|Uax < C(/j(Q,fTe)A/ — ■ (28) 

V n 

Finally, suppose the lower-RE condition holds uniformly over the matrices F^-'^ with the scal- 
ing (18). Then under the estimation procedure of Algorithm 3.1, there exists a universal 
constant cq such that 



lie _ em ^ cok2(£) . (^(Q,^e) ^ vp(Q,ae)^,^, /logp 



Amm(5^) Amin(5^) «! V n 

Remarks. Note that Corollary 5 is again a deterministic result, with parallel structure to 
Theorem 1. Furthermore, the deviation bounds (27) and (28) hold for all scenarios considered 
in Section 3.2 above, using Corollaries 1-4 for the first two inequalities, and a similar bounding 
technique for — SUmax! and the lower-RE condition holds over all matrices F^-'^ by the same 
technique used to establish the lower-RE condition for F. The uniformity of the lower-RE 
bound over all sub-matrices holds because 

< Aniin(5^) ^ Amin(5^-j,-j) ^ Amax(5^-j,-j) ^ Aniax(S) < OO. 

Hence, the error bound in Corollary 5 holds with probability at least 1 — ci exp(— C2 logp) 
when n ^ klogp, for the appropriate values of and ai. 
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4 Simulations 



In this section, we report some additional simulation results to confirm that the scalings 
predicted by our theory are sharp. In Figure 1 following Theorem 1, we showed that the error 
curves align when plotted against a suitably rescaled sample size, in the case of additive noise 
perturbations. Panel (a) of Figure 3 shows these same types of rescaled curves for the case of 
missing data, with sparsity k ~ yjp, covariate matrix Ti^ = /, and missing fraction p = 0.2, 
whereas panel (b) shows the rescaled plots for the vector autoregressive case with additive 
noise perturbations, using a driving matrix A with |||j4|||op = 0.2. Each point corresponds to an 
average over 100 trials. Once again, we see excellent agreement with the scaling law provided 
by Theorem 1. 




(a) 



(b) 



Figure 3. Plots of the error — /3*||2 after running projected gradient descent on the non- 
convex objective, with sparsity k sa y^. In all cases, we plotted the error versus the rescaled 
sample size ^, ^ . As predicted by Theorems 1 and 2, the curves align for different values of p 
when plotted in this rescaled manner, (a) Missing data case with i.i.d. covariates. (b) Vector 
autoregressive data with additive noise. Each point represents an average over 100 trials. 

We also ran simulations to verify the form of the function (/?(Q, cXe) appearing in Corollar- 
ies 1 and 2. In the additive noise setting for i.i.d. data, we set S^,. = / and e equal to i.i.d. 
Gaussian noise with = 0.5. For a fixed value of the parameters p = 256 and k ~ logp, 
we ran the projected gradient descent algorithm for different values of cTw G (0.1,0.3), such 
that = cr^/ and n 60(1 -|- cr^ ) ^/c log p, with ||/3*||2 = 1- According to the theory, 
^^{(Tu, + 0.5)v^l + f72, SO that 



/3 2 ;6 {(Tw + 0.5 VI + I, ^ 2^2M - H 9 



In order to verify this prediction, we plotted versus the rescaled error — /3*||2- 

As shown by panel (a) of Figure 4(a), the curve is roughly constant, as predicted by the 
theory. 

Similarly, in the missing data setting for i.i.d. data, we set Y^x = I and e equal to i.i.d. 
Gaussian noise with = 0.5. For a fixed value of the parameters p = 128 and k ~ logp, 
we ran simulations for different values of the missing data parameter p G (0,0.3), such that 
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Rescaled error plot, missing data 



0.05 0.1 



0.2 0.25 



(a) Additive noise scaling 



(b) Missing data scaling 



Figure 4. (a) Plot of the rescaled £2-error — /?* II2 versus the additive noise standard 

deviation <7yj for the i.i.d. model with additive noise, (b) Plot of the rescaled €2-error 
versus the missing fraction p for the i.i.d. model with missing data. Both curves are roughly 
constant, showing that our error bounds on — /3*||2 exhibit the proper scaling. Each point 
represents an average over 200 trials. 



n 



60 



:k\ogp. According to the theory, 



-p + (l-p)2 • 



specified scalings of {n,p, k), we should expect a bound of the form 



Consequently, with our 



I2 ^ 



a 



,ae) klogp 



n 



1 + 0.5(1 -p). 



The plot of p versus the rescaled error is shown in Figure 4(b). The curve is again 

roughly constant, agreeing with theoretical results. 

Finally, we studied the behavior of the inverse covariance matrix estimation algorithm on 
three types of Gaussian graphical models: 

(a) Chain- structured graphs. In this case, all nodes of the graph are arranged in a linear 
chain. Hence, each node (except the two end nodes) has degree k = 2. The diagonal 
entries of are set equal to 1, and all entries corresponding to links in the chain are set 
equal to 0.1. Then is rescaled so |||0|||op = 1. 

(b) Star-structured graphs. In this case, all nodes are connected to a central node, which 
has degree k ~ O.lp. All other nodes have degree 1. The diagonal entries of G are set 
equal to 1, and all entries corresponding to edges in the graph are set equal to 0.1. Then 



is rescaled so |||0|| 



1. 



(c) Erdds-Renyi graphs. This example comes from Rothman et al. [19]. For a sparsity 
parameter k ~ logp, we randomly generate the matrix by first generating the matrix 
B such that the diagonal entries are 0, and all other entries are independently equal to 
0.5 with probability k/p, and otherwise. Then 5 is chosen so that Q = B + 61 has 
condition number p. Finally, is rescaled so |||0|||op = 1- 

After generating the matrix X of n i.i.d. samples from the appropriate graphical model, 
with covariance matrix S^. = 0~^, we generated the corrupted matrix Z = X + W with 
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= (0.2)^/ in the additive noise case, or the missing data matrix Z with p = 0.2 in the 
missing data case. 

Panels (a) and (c) in Figure 5 show the rescaled ^2-error ;^|||0 — ©flop plotted against 
the sample size n for a chain-structured graph. In panels (b) and (d), we have ^2-error 
plotted against the rescaled sample size, n/ (klogp). Once again, we see good agreement with 
the theoretical predictions. We have obtained qualitatively similar results for the star and 
Erdos-Renyi graphs. 




(a) ^2 error plot for chain graph, additive noise 



Chain graph, missing d 
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(b) rescaled plot 



Chain graph, missing data 



200 300 400 500 




(c) £2 error plot for chain graph, missing data 



(d) rescaled plot 



Figure 5. (a) Plots of the error |||0 — 0|||op after running projected gradient descent on the 
non-convex objective for a chain-structured Gaussian graphical model with additive noise. As 
predicted by Theorems 1 and 2, all curves align when the error is rescaled by and plotted 
against the ratio ^ ^ , as shown in (b). Plots (c) and (d) show the results of simulations on 
missing data sets. Each point represents the average over 50 trials. 



5 Proofs 

In this section, we turn to the proofs of our two main theorems, as well as their various 
corollaries. Included in the main text are the primary steps in the proofs, with more technical 
aspects of the proofs deferred to the appendices. 
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5.1 Proof of Theorem 1 



Let = ^P'^T(3 — /3) + An||/3||i denote the loss function to be minimized. This definition 
captures both the estimator (4) with = and the estimator (7) with the choice of A„ given 
in the theorem statement. For either estimator, we are guaranteed that (3* is feasible and /3 
is optimal for the program, so /2(/3) < C{I3*). Indeed, in the regularized case, the A:-sparsity 
of (3* implies that ||/3*||i < \fk\\P*\\2 < boVk. Defining the error vector := (3 — (3* and 
performing some algebra leads to the equivalent inequality 

h^f^) < (p, 7 - rr> + A„{iiriii - iir + PHi}. (29) 

In the remainder of the proof, we first derive an upper bound for the right-hand side of this 
inequality. We then use this upper bound and the lower-RE condition to show that the error 
vector P must satisfy the inequality 

\mi<8Vk\m2. (30) 

Finally, we combine the inequality (30) with the lower-RE condition to derive a lower bound 
on the left-hand side of the basic inequality (29). Combined with our earlier upper bound on 
the right-hand side, some algebra yields the claim. 



Upper bound on right-hand side We first upper-bound the RHS of inequality (29). 
Holder's inequality gives (z/, 7 — r/3*) < ||z?||i ||7 — r/3* ||oo. By the triangle inequality, we have 



|7-r/3*||oo < ||7-S:./3*||oo + ||(S.-r)/3*||oo < 2v?(Q,c7,^ 



n 



where inequality (i) follows from the deviation conditions (17). Combining the pieces, we 
conclude that 



(P, 7 - r/3*) < 2\\uh v^(Q,ae)y = (||i?5||i + IMi) 2(^(Q,a,)^i^. (31) 
On the other hand, we have 

lir + ^lli - lirili > {\Wsh - W^sWi} + \\^s4i - lirili = \\ks4i - W^sWi, (32) 

where we have exploited the sparsity of f3* and applied the triangle inequality. Combining 
the pieces, we conclude that the right-hand side of inequality (29) is upper-bounded by 



log p 

o"e)\/ (l|z^s||i + lli^sHli) + A„{||z?5||i - (33) 

n 



a bound that holds for any nonnegative choice of A„. 



Proof of inequality (30) In the case of the constrained estimator (4) with R = ||/3*||i, 
we have ||/3||i = ||/3* + z?||i < ||/3*||i. Combined with inequality (32), we conclude that 
< ll^^slli- Consequently, we have the inequality ||P||i < 2||z?s'||i < 2\/fc||^'||2) which is a 
slightly stronger form of the bound (30). 
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For the regularized estimator (7), we first note that our choice of An guarantees that the 
term (33) is at most — ^||i'S'':||i- Returning to the basic inequahty, we apply the 

lower- RE condition to lower-bound the left-hand side, thereby obtaining the inequality 

''"11^112 ^ 1/' ||-~||2 ||^||2\ ^ 3An II -^".11^ II 

--Iklli < 2(«ill^ll2 -'^ll^lli) ^ ^ll'^slli - ylli^sHli- 
By the triangle inequality, we have < ||/3||i + ||/3*||i < 26o\/A;. Since we have assumed 



'%T{n,p) < '^^'bp'^^"* yJ^^TTi we are guaranteed that 



^ ^ < (^(Q,ae)A/— < ^||^>||i, 



by our choice of A^. Combining the pieces, we conclude that 

^^fi II II \ \ II /\\ II 11^ II \ w II w m 

< —\\Ks\\i - + ^(Ir^lll + IksHIl) = ^Ikslll - ^Ik5<=||l, 

and rearranging implies < THz^sHi, from which we conclude that < 8\/A;||z?||2, as 

claimed. 



Lower bound on left-hand side We now derive a lower bound on the left-hand side of 
inequality (29). Combining inequality (30) with the RE condition (12) gives 

V^fD>ai\\ug-T{n,p)\\m > {ai-6AkT{n,p)}\\m > ^ Ml (34) 

where the final step uses our assumption that kT{n,p) < 
Finally, combining bounds (33), (30), and (34) gives 



"1 l|-||2 / o 

— 9 < imax 
4 " ""^ ~ 



{2^{Q,a,)^J^, An} ll^lli 
< 32 \/% max {cpiQ, a^)^^^^^ , An} \\v\\2i 



yielding inequality (19a). Using inequality (30) again gives inequality (19b). 



5.2 Proof of Theorem 2 

We begin by proving the claims for the constrained problem, and projected gradient descent. 
For the ^2-error bound, we make use of Theorem 1 in the pre-print of Agarwal et al. [']. 
Their theory, as originally stated, requires that the loss function be convex, but a careful 
examination of their proof shows that their arguments hinge on restricted strong convexity 
and smoothness assumptions, corresponding to a more general version of the lower- and upper- 
RE conditions given here. Apart from these conditions, the proof exploits the fact that the 
sub-problems defining the gradient updates (14) and (15) are convex. Since the loss function 
itself appears only in a linear term, their theory still applies. 

In order to apply Theorem 1 in their paper, we first need to compute the tolerance param- 
eter defined there; since /3* is supported on the set 5 with |5| = A; and the RE conditions 



21 



hold with r x fi^d that 

n ' 

^2 < rii2 + 2||^- rill)' 

^ / fclogp -J 2 , logPllfl «*||2 

II/? -/3 II2 + C1 11^-/3 111 

02?! 02?! 

<C20-f3*\\l + ci^—\\d-P*\\l 

where the final inequality makes use of the assumption that n ^ /clog p. Similarly, we may 
compute the contraction coefficient to be 

7=fl-^ + ^i^l fi_^^^y\ (35) 

so 7 G (0, 1) for n ^ /clogp. 

We now establish the £i-error bound. First, let A* := /?* - ^ • 

Since /3* is feasible and /3 

is optimal with an active constraint, we have ||/3*||i < ||/3||i. Applying the triangle inequality 
gives 



II < iiriii + 11/3- rill = iinii + 11/3 -rill, 

||/3*||i = lir + A*||i > lir + A*5.||i - ||A*5||i = II/3JII1 + ||A*5.||i - ||A*5||i; 
combining the bounds yields ||A^c||i < ||A^||i + ||/3 — /3*||i. Then 

||A*||i < 2||A*5||i + 11^- rill < 2^/^||A*||2 + 11^- rill, 

so 

11/3* - ^iii < 11^- rill + iiA*iii < 2V^(ii/3* - ^||2 + 11^- rib) + 2||^- r 111. 

Turning to the Lagrangian version, we exploit Theorem 2 in Agarwal et al., with M. 
corresponding to the subspace of all vectors with support contained within the support set 
of /3*. With this choice, we have Vl-^) = Vfc, and the contraction coefficient 7 takes the 
previous form (35), so that the assumption n ^ /clogp guarantees that 7 S (0, 1). It remains 
to verify that the requirements are satisfied. From the conditions in our Theorem 2 and 
using the notation of Agarwal et al., we have /3(A^) = C'(^^^) and p = Vk, and the condition 
n'^ k logp implies that = 0(1). Putting together the pieces, we find that the compound 

tolerance parameter satisfies the bound = ©(^^ ||/3 - r 111) = 0(11,^ - r 111), so the 
claim follows. 

5.3 Proof of Corollary 1 

The proof of this corollary is based on two technical lemmas, one establishing that the lower- 
and upper-RE conditions hold with high probability, and the other proving a form of the 
deviation bounds (17). 

Lemma 1 (RE conditions, i.i.d. with additive noise). Under the conditions of Corollary 1, 
there are universal positive constants Ci such that the matrix satisfies the lower- and 
upper-RE conditions with parameters ai = ^shi^^^eI ^ = |Aniax(5^a;); CL^d 

T{n,p) =€0 Xmini^x)raax{ \l , 1) 



with probability at least 1 — ci exp ( — C2nmin (^t^f^^, l)) . 
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Proof. Using Lemma 13 in Appendix A, together with the substitutions 

r-Sa; = E2, and s:=-- mm<^i^ ,1^, (36) 

n clogp J 

where = + and c is chosen sufficiently smah so s > 1 , we see that it suffices to show 
that 

sup \V ( < 



veK{2s) 



n ' ^ 54 



D{s) 

with high probabihty. 

Note that the matrix Z is sub-Gaussian with parameters (S^; + S^, cj^). Consequently, by 
Lemma 15 in Appendix B, we have 

^2 ^ 

IP[-D(s) >t\ <2exp(-c'nmin(— r,^) +2slogp), 

for some universal constant c' > 0. Setting t = ""^^ , we see that as long as the constant 
c in the definition (36) is chosen sufficiently small, we are guaranteed that 

^Dis) > ^^^] < 2exp ( - c^nmm (^f^^, l)), (37) 
which establishes the result. 

□ 



Lemma 2 (Deviation conditions, additive noise). Under the conditions of Corollary 1, there 
are universal positive constants Ci such the deviation hound (16) holds with parameter 

99(Q,cre) = coaz{aw + cxe) ||/3* II2, 
with probability at least 1 — ci exp(— C2 logp). 
Proof. Using the fact that y = X (3* + e, we may write 

ll7-r/3*||oo = 
< 

Hence, the conclusion follows easily from Lemma 14 in Appendix B. 

□ 



.Z^y .Z^Z 



n 



n 



,Z^{X(3* + e) ,Z^Z 



n 



n 



+ 



n 

Z^W 

n 



-S^)/3* 
)/5lL- 
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Extension to unknown In the case when Tjyj is unknown, we first verify the deviation 
bound (16). Note that the form of 7 is the same as in the case when is known, so it 
suffices to bound the quantity ||(r — Sj,.)/3*||oo w.h.p. Furthermore, 

||(f - S,)/3*|U < ||(f - f )r Hoc + ||(f - S,.)/31|oo 

= ||(S^-S^)/3*|U + ||(f -S,)/3*||oo, 



and the second term is bounded by ca^\/ w.h.p., by Lemma 14 in Appendix B. If we use 



z V n 

the estimator = ^WqWq, then 



P(||(S^ - S^)/3*||oo <ccT2yi^) > l-ciexp(-C2logp) 
by the same sub-Gaussian tail bounds. Since o"^ < a^, we conclude that 

||(f -S,.)/3*||oo <ca^ 



2 /logp 



n 



with probability at least 1 — ci exp(— C2 logp), as wanted. 
Turning to the RE conditions, we similarly write 

\v^(r - j:^)v\ < |v^(r - f)v\ + \v^{f - ^.^)v\ 

Then applying Lemma 15 to both terms, followed by Lemma 13, yields the required bounds. 



5.4 Proof of Corollary 2 

We now turn to the proof of Corollary 2, which applies to the case of missing data, based on 
the general M-estimator using the pair (rn,is,7^is) defined in equation (11). We will establish 
that the RE conditions and deviation conditions (17) hold with high probability. 

Lemma 3 (RE conditions, i.i.d. with missing data). Under the conditions of Corollary 2, there 
are universal positive constants Ci such that r„„ satisfies the lower- and upper-RE conditions 
with parameters ai = ^jsh^^^)^ = |Amax(Sx), and 

T{n,p) = coAmin(Sx)max (- ^t^-^^tt, l) 

with probability at least 1 — ci exp ( — C2nmin ((1 — Pmax)^ • -^^^^Jr^i l)) • 

Proof. This proof parallels the proof of Lemma 1 for the additive noise case. We make use of 
Lemma 13. This time, we have 

- Z'^Z Z'^Z 

r - S^. = © M - s^. = ( S,) © M, 

n n ' 
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with the parameter s defined as in equation (36), with = -p. — — — v^. Note that for a vector 

V-L Pmax J 

V gMP, we have 

\v'^(r-^^)v\ = \v^{{^-^,)®M)v\ 

1 i T' / Z , \ 1 

~ \\M\\ ■ ' ^ n ' ^ 

11-"^ 1 1 mm 



Furthermore, Z is a sub-Gaussian matrix with parameters (S^, a^), so applying Lemma 15 in 

54 



Appendix B with t = (1 — Pmax)^ ''^mmCSr^) ^j^^ right-hand expression, we obtain the bound 



iDis) > ^^^^^] < 2exp ( - C2n min ((1 - p^^^^ ' l)). 



□ 



Lemma 4 (Deviation conditions, missing data). Under the conditions of Corollary 2, there 
are universal positive constants Ci such the deviation hounds (17) hold with parameter 

(^(Q, cjj = Co (o-e + ) , 

J- Pmax ^ Pmax 

with probability at least 1 — ci exp ( — C2 logp) . 

Proof. The key idea is to note that the observed matrix Z is a sub-Gaussian matrix with 
parameter cr^. Indeed, recaUing that the hidden matrix X is sub-Gaussian with parameter 
cr^, we see that for any unit vector v E M^, and any missing value pattern of Xj, we have 

E [ exp(AZf ) I missing values] = E(exp(AXiii)) < exp (^^), (39) 

where the vector u has entries Uj = Vi when entry i is observed, and = otherwise. 
By the tower property of conditional expectation, it follows that the moment generating 
function of Zv is upper-bounded by the same quantity, so Z is also a sub-Gaussian matrix 
with parameter at most o"^. 
Observe that 

||7-S,ri|oo = ||(^(Z^y-cov(Zi,y))©(l-p))/3*|| 

< ||-(Z^y-cov(z„y))/3*||^ 

J- Pmax ^ 

< r^(ll-(^^^-^°^("-^^))^1L + ll— ID- (40) 

1 - yOmax , n ' "n 



Using the sub-Gaussianity of the matrices X, Z, and e, and Lemma 14, the two terms may 
be bounded as 



^ a1 /logp 
Ti >co- 

(1 - Pmax)'' V n 



72 > Cot- rW 

(1 - Pmax) V n 



< ci exp(-C2logp), (41a) 

< ci exp(— C2 logp). (41b) 
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Now consider the quantity || (F — Sa;)/3* ||oo- By a similar manipulation, we have 



ll(r-s,)/3* 



l((- 



< 



n 
1 



S.)©M)/3 



n 



S.)/3*| 



so using Lemma 14 yields 



(r-s.)r|L<co- 



( 1 Pmax ) ^ 



logp 



n 



(42) 
(43) 

(44) 



with probability at least 1 — ci exp(— C2 logp). Combining bounds (41) and (44), we conclude 
that the deviation conditions (17) both hold with parameter 

0"J = Co — (— h CTe), 

-L Pmax J- Pmax 



with probability at least 1 — ci exp(— C2 logp), as claimed. 



□ 



Extension to unknown pj We now consider the more challenging case when the missing 
probabilities pj are unknown. Note that the estimates satisfy the deviation bound 



°(max|pj — Pj\ > t) < ci exp(— C2nt^ + logp). 



(45) 



by a Hoeffding bound for Bernoulli random variables, together with a union bound. In 



particular, taking t = cq 



we have 



\P- P\\ 



< 



Co 



logp 



n 



(46) 



with probability at least 1 — ci exp(— C2 logp). 

As long as n /, , as required by our results, the deviation condition (46) implies 

\^ Pmax; 

that \pj — Pj\ < ^"2'"'"'' for each j, so 



(1 - Pj) > (1 - Pj) 
In particular, we obtain the bound 



max 
j 



1-pj 



1 



< max 



\Pj-pj\ 



< 



and since 



i-Pj 



1-pj 

— l| < 1 by inequality (47), we also have 
1- Pi){l- Pj) 



max 



r (1-Pmax)(l-P,) " (1-Pmax)2 ^"r'^^' 



max 



(l-p,)(l-p,) 



(47) 



(48) 



(49) 



l)(^-l) + (^-l) + ^'-^^- 



< 3 max 

j 



I- Pj 



1-pj 



< 



6 



max 



(1 - Pmax)^ i 



Pj-Pj\ 



■I-Pj 



(50) 
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using the triangle inequality and inequality (48). 

We will use these bounds to verify the results of Lemmas 3 and 4 for the estimators (24). 
For the deviation bounds, we begin by writing 

||(r - s,)riloo < ll(r - f )riloo + ll(r - s,)/3*||oo. 

Note that we have already bounded ||(r — 5]a.)/3*||oo in inequality (44). Furthermore, 

||(r-r)riloo = IK ©M ©M)nu 

= \\{^—^®M) (M©M- 11^) /?* 
<||M©M-11^|| IK ©M)B*\\ 

— II ^ II max 1 1 V / ^ 1 1 CO 

2 

< jT— 72 max|/)j - pj\ (||r - S^)/3*||oo + ||oo) , 

\ J- Pmax ) 1 

where we have used inequality (48) and the triangle inequality in the last inequality above. 
Noting that ||Sx/3*||oo < Amax(5^2:)||/3*||2 < ccr^ and using the bounds (44) and (46), we obtain 



ll(r-5].riU<y^-^\/— 

Combining the pieces, we conclude that the deviation conditions (17) are satisfied with 

(/?(Q, o-J = Co 1 (jz^ h c^^, as claimed. 

For the RE conditions, we use a similar argument. Note that by Lemma 13, we need to 
show that |?;'^(f - S^>| < for all v £ K(2s), with high probability. We write 

|t;^(f - S^)t;| < \v'^(r - f)v\ + \v^{f - S^)u|, 

and note that we have already shown how to upper-bound |v"^(r — r)f | by cAniin(Sa;) with 
high probability. Furthermore, 

Iv'^Cr - T)v\ = Iv'^i © M © M)v\ 

I ' ' ^ n n ' ^ 



= \v^{ ©M) {M©M -\\^\v\ 



n 

I max I 



< \\M®M -lV\\\v' [ ®M)v\ 

II II IlldX I ^ ' I 

< ||M© M - H^\\^^^{\v^{f - S,)t>| + 

Making use of inequality (49) and the concentration bound (45) with t = c (^"^ ^ ( 1 — Pmax ) ^ ; 
we obtain 

||M©M- ll^lUaxb^S^^^I <c\r^in{^x) 

with probability at least 

1 - ci exp [ - C2n(l - p^^)4^k(|^] > 1 _ exp [ - c^l - Pma.)^ ^^^°fi^ ] . 

'^maxl^ajj '^x 

Note that t < c', so the earlier upper bound on \v'^ [T — Tix)v\ is sufficient to ensure that 
\v'^ {r — Tix)v\ < ''^■"■g^^^) with the required probability. 
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5.5 Proof of Corollary 3 

We now need to establish the RE conditions and deviation bounds (17) for the Gaussian VAR 
case, which we summarize in the following: 

Lemma 5 (RE conditions, dependent case with missing data). Under the conditions of Corol- 
lary 3, there are universal positive constants Ci such that T^i, satisfies the lower- and upper-RE 
conditions with ai = ^"""^ , 02 = ^^maxC^x) , o-nd 

T{n,p) = coAmin(Sx.)niax(-2 — t^ttI) ! 

with probability at least 1 — ci exp ( — C2nmin ( ''^'"'^4^'^'* , l)) . 

Proof. The proof is identical to the proof of Lemma 1, except we use Lemma 18 instead of 
Lemma 15 in Appendix B. 

□ 

Lemma 6 (Deviation conditions, VAR with additive noise). Under the conditions of Corol- 
lary 3, there are universal positive constants Ci such the deviation bounds (17) hold with 
parameter 



with probability at least 1 — ci exp ( — C2 logp) . 
Proof. We begin by bounding the term 

||(f-S,)riU= max |eJ(-Z^Z-S,)r|. 

Define the function <I>(u,f) := [^Z^^ Z — Ti^^v and rewrite the term as maxi<j<p |$(ej,/3*)|. 
For each fixed j, some simple algebra shows that 

$(e„r) = \{Heo + r,e,- + /?*) - a>(e„e,) - f (/3*,/3*)}, (51) 

so it suffices to have a high-probability upper bound on the quantity (^{v.,v) for each fixed 
unit vector v. In particular, combining inequality (76) from Lemma 17 (see Appendix B) with 
the union bound and the relation (51), we conclude that 



P[||(f-S,)ri|oo >coCy^] <ciexp(-C2logp). (52) 

We now turn to the quantity ||7 — Sa;/3*||oo, which by the triangle inequality may be 
upper-bounded as 

||7 - < II [^Z^Z - S,)r IL + II {^W^W - S^)/3*||^ 

+ \\lz^4oo + t-^'^Wnoo- (53) 

We have already bounded the first term in inequality (52) above. As for the second term, 
the matrix W is sub-Gaussian, so that Lemma 14 can be used to control it (as in previous 
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arguments). In order to upper-bound the third term on the RHS of equation (53), we first 
condition on Z. Under this conditioning, the third term may be written as max£=i 

where vn := ^{Zeg, e) is a zero-mean Gaussian variable with variance at most if ( ^^^/^^^ 
Combining the union bound and the deviation bound (76) with t = 1, we conclude that as 
long as n ^ logp, then 

1 1 ^ ^ 1 1 

F\ max ( — — )^ > cC^l < ci exp ( — C2 logp) . 

^e=i,...,p Jn 



Conditioning on this event and applying standard tail bounds to control {f^}, we conclude 
that > t] < exp ( — ^§^nt'^Y Setting t = CQa^C,\J^^^^ and then taking a union bound 

over £ G {1, . . . ,p} yields the desired result. A similar analysis can be used to bound the 
fourth term, since the matrices X and W are independent. Combining the pieces yields the 
claim. □ 

5.6 Proof of Corollary 4 

Lemma 7 (RE conditions, dependent case with missing data). Under the conditions of Corol- 
lary 4, there are universal positive constants Ci such thatT^^^ satisfies the lower- and upper-RE 
conditions with ai = r^siii^^il^ = |Amax(5^x); and 

T{n,p) = coAmin(Sx)max(-^-^— — ,l)i^^, 
with probability at least 1 — ci exp ( — C2nmin ( ''^'"'^,4^''"* , l)) . 

Proof. Again, we simply substitute the bound of Lemma 18 for the bound of Lemma 15 in 
the proof of Lemma 3. 

□ 

The final step is verify the deviation bounds (17). 

Lemma 8 (Deviation conditions, VAR with missing data). Under the conditions of Corol- 
lary 4, there are universal positive constants Ci such the deviation bounds (17) hold with 
parameter 



P,a,) = coC' (C' + <^.), 
with probability at least 1 — ci exp(— C2 \ogp). 

Proof. To control the term ||(r — Sa;)/3*||oo, we use the same argument as in Lemma 6 to 
obtain inequality (52). For the term ||7 — S2,/3*||oo, we use the expansion (40) from the i.i.d. 
case. We show how to bound the terms Ti and T2 appearing in the expansion. 

II I|2 II ||2 

For a vector v G W, write ^'(f) = -^^^ — E(^i^), and note that 



Ti = max ^ \^{Zej + X/3*) - ^{ZeA - ^{Xp*)] . (54) 
i 2 '- 



By Lemma 17, we may upper-bound the last term in equation (54) by CC'^(1 — /Omax)^ y 
with probability at least 1 — ci exp(— C2 logp). In order to bound the other two terms. 
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we again use Lemma 17. Note that Zej is a mixture of Gaussians N{0,Qj), each with 
ilQjliop < (1 — Pmax)^C'^- Then Zej + is also a mixture of Gaussians N{0,Q'j), and we 
have the bound |||Q-|||op < 4C'2(1 ~ Pmax) • Hence, by Lemma 17 and a union bound, we con- 
clude that Ti < cC'^(l — /Omax)^ y'''^^' '^it'^ probability at least 1 — ci exp(— C2 logp). Turning 
to term T2 in the expansion (40), we condition on Z. Repeating the argument in Lemma 6, 
we obtain the bound 

T2 < CoCJeC'(l - /5max) 

Finally, plugging back into inequality (40), we arrive at the bound 

II7 - < C(C''(1 - Pmax) + CTeO 

Altogether, we have the form of f given by (p{Q,ae) = co((TeC' + C'^); s-s claimed. □ 





5.7 Proof of Corollary 5 

First note that by Theorem 1, we have the bounds 



\e^-e^\U<^-r^:^k/-^, (55) 

ai \ n 



\\e^-e^2<^-r^^J'^. (56) 

ai \ n 

We now establish the following lemma, which we will use to prove the theorem. 
Lemma 9. For each 1 < j < p, we have 

^ <\aj\<—^—r and ||6I^'||2 < «;(S). (57) 



-^max(S) '^min(5^) 



Proof. Observe that \aj\ < maxj|©jj| < Amax(0) = ^ -^(s) ' ^^"^ similarly, \aj\ > j — 
which establishes the first inequality (57). Next, note that the rows (and also columns) of Q 
are bounded in ^2-iiorm according to the inequality 

1 

||Gj.||2 = ||Gej||2 < Amax(0) 



Amin(^) 

which implies that ||^-'||2 = ||0-i/%'||2 = Ib/jojl < '^(S), as claimed. □ 

Moving forward, we establish the following deviation inequalities between aj and Q.j and their 
respective estimators. 

Lemma 10. For all j, we have the following deviation inequalities: 



Amin(5]) Amin(5]) Ol 







n 


/log 


p 
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Proof. We first derive inequality (58). Since the columns of G are fe-sparse, H^-'Hi 
Then 

< Wh + 11^^- - ^li < cVkiWh + ^^/^) 

cti \ n ' 

where we have used Lemma 9 and inequality (55). Under the assumed scaling n ^ klogp, 
this simplifies to the inequality 

\\0^\\i < CK{^)Vk. 



We now have 



'l = |(S,,-S,,_,^)-(S,,-S,,_,0^)| 



(60) 



(61) 



Using inequality (28), we have Ti < c(/?(Q,cre 



Furthermore, 



< ||S-S|Uax||^^'||l + ||S,,_,-||2||^-''-e^'||2 



,ae)x klogp 



<c(v9(Q,a,)K(S) + An,ax(S:,.) 

Qi V n 

using inequality (60) and inequality (56). Substituting back into inequality (61), we obtain 



|a/-a/| <c 



crjK(S) + Aniax(Sa;) 



(Te). klogp 



n 



for all J. Hence, 

Oi 

a,; 



1 



r'-ar'i<c^(s)( 



+ 



Amin(S) ai 



,f7e)x klogp 



n 



using Lemma 9, so \aj\ < 2a j for n ^ fclogp, and 

ck(S) 



a, — a, 



< 



'^min(S) '^min(S) 



+ 



(Te)s, klogp 



n 



(62) 



which establishes inequality (58). 

Turning to inequality (59), we have 

||6.j— 0.J 111 = \aj — aj\ + 11%^-' — fflj^-'lli 

<|%-a,| + |a,|||^-'--e^||i + |a,-a,-|||^^||i 

~ ^Amin(S) ai Amin(S) ^ Amin(S) 



'))k 



logp 



n 



by a combination of inequalities (55), (58), (60), and (62). Noting that k(S) > 1, we arrive 
at inequality (59). 

□ 



31 



Returning to the proof of Corollary 5, observe that since Q and Q are symmetric, we have 
III© ~ ©flop ^ III© — ©ill- Furthermore, by the triangle inequality and the definition of 0, 

III© - ©ill < III© - ©ill + III© - ©ill < 2|ie- e|ii = 2max||e.,- -e.Ji, 

j 

so that the union bound and inequality (59) yield the claim. 

6 Discussion 

In this paper, we formulated an £i-constrained minimization problem for sparse linear regres- 
sion on corrupted data. The source of corruption may be additive noise or missing data, and 
although the resulting objective is not generally convex, we showed that projected gradient 
descent is guaranteed to converge to a point within statistical precision of the optimum. In 
addition, we established ii- and ^2-error bounds that hold with high probability when the 
data are drawn i.i.d. from a sub-Gaussian distribution, or drawn from a Gaussian vector au- 
toregressive process. Finally, we used our results on linear regression to perform sparse inverse 
covariance estimation for a Gaussian graphical model, where the data are observed subject to 
corruption. The bounds we obtain for the spectral norm of the error are of the same order as 
existing bounds for inverse covariance matrix estimation when the data are uncorrupted and 
i.i.d. 

Future directions of research include studying more general types of dependencies or cor- 
ruption in the covariates of regression, such as more general types of multiplicative noise; 
and performing sparse linear regression for corrupted data with additive noise when the noise 
covariance is unknown and replicates of the data may be unavailable. As pointed out by a 
reviewer, it would also be interesting to study the performance of our algorithms on data 
that are not sub-Gaussian, or even under model mismatch. In addition, one might consider 
other loss functions, where it is more difficult to correct the objective for corrupted covariates. 
Finally, it remains to be seen whether or not our techniques — used here to show that certain 
non-convex problems can solved to statistical precision — can be applied more broadly. 
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A Restricted eigenvalue conditions 

In this appendix, we provide the proofs for various lemmas used to establish restricted eigen- 
value conditions for different classes of random matrices, depending on the observation model. 
We begin by establishing two auxiliary lemmas, and then proceed to the main lemma used 
directly in the proofs of the corollaries. Our first result shows how to bound the intersection 
of the £i-ball with the ^2-ball in terms of a simpler set. 

Lemma 11. For any constant s > 1, we have 

Bi(Vs) n B2(l) C 3 cl{conv {Bo(s) n B2(l)}}, (63) 
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where the balls are taken in p- dimensional space, and cl{-} and conv{-} denote the topological 
closure and convex hull, respectively. 



Proof. Note that when s > p, the containment is trivial, since the right-hand set equals 182(3), 
and the left-hand set is contained in 182(1). Hence, we will assume 1 < s < p. 

Let A, B '^W be closed convex sets, with support function given by (j)Aiz) = supgg^(^, z) 
and (pB similarly defined. It is a well-known fact that (pAiz) < 4>Biz) if and only if A C i? 
(cf. Theorem 2.3.1 of Hug and Weil [' ]). We now check this condition for the pair of sets 
A = Bi(^)nB2(l) and 5 = 3 cl { conv {Bo(s) n B2(l)}}. 

For any z G W, let S C {1, 2, . . . ,p} be the subset that indexes the top [sj elements of z 
in absolute value. Then < \zj\ for all j £ S, whence 

„ „ 1 „ „ 1 „ „ 

\\zs4oc, < ■T—\\\^s\\i < -j^\\zsh- (64) 



We now split the supremum over A into two parts, corresponding to the elements indexed by 
S and its complement S^, thereby obtaining 

(j)A{z) = sup{9, z) < sup {9s, zs) + sup (6*5=, zs^) 
e&A \\8sh<i \\es'^\\i<y^ 

< Wzsh + Vs\\zsA\oo 

<(i + ^L^)lk5lb 

< 31125112, 

where step (i) makes use of inequality (64). Finally, we recognize that 

(j)B{z) = sup(0, z) = 3 max sup {9u, zu) = 3||z5||2, 
eaB \u\=Vs\ ||6it,i|2<i 

from which the claim follows. □ 

For ease of notation, define the sparse set ]K(s) := Bo(s) nB2(l) and the cone set C(s) := 
{v : \\v\\i < \/~s\\v\\2}. Our next result builds on Lemma 11, showing how to control deviations 
uniformly over vectors in MP. 

Lemma 12. For a fixed matrix T G M^^^, parameter s > 1, and tolerance 6 > 0, suppose we 
have the deviation condition 

\v'^Tv\<6 yveK{2s). (65) 

Then 

\v'^Tv\< 27 6{\\v\\l + ^\\v\\l) yveW. (66) 
Proof. We begin by establishing the inequalities 

\v'^rv\ < 27 6 \\vg Vw e C(s), (67a) 

Iv'^Tvl < \\v\\l 4 C(s). (67b) 

s 

Inequality (66) then follows immediately. 
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By rescaling, inequality (67a) follows if we can show that 

|f'^ri;| < 27(5 for all v such that ||u||2 = 1 and ||^;||i < y/s. 

By Lemma 11 and continuity, we further reduce the problem to proving the bound (68) for all 
vectors v G 3conv {]K(s)} = conv {Bo(s)nB2(3)}. Consider a weighted linear combination of 
the form v = Q-iVi, with weights a, > such that ai = 1, and \\vi\\o < s and \\vi\\2 < 3 
for each i. Expanding, we can write 



Applying inequality (65) to the vectors ^Vi, ^vj, and ^{vi + Vj), we have 

\vfTvj\ = l\{vt + VjfT{vi + Vj) - vfVvi - vjTvjl < ^(365 + 95 + 96) = 275 

for all z, j, and hence lii^rul < Ylij oiiaj{215) = 275\\a\\\ = 275, establishing inequality (67a). 
Turning to inequality (67b), first note that for v ^ C(s), we have 

\v'^Tv\ 1 I I 27<5 

< - sup \u^Tu\ < , (69) 



ll^lll ^ ll«lli<v^' 

II«I|2<1 

where the first inequality follows by the substitution u = and the second follows 

by the same argument used to establish inequality (67a), since u is in the set appearing in 
Lemma 11. Rearranging inequality (69) yields inequality (67b). □ 

Lemma 13 (RE conditions). Suppose s > 1 and T is an estimator of T,x satisfying the 
deviation condition 

^^(f - S,)H < ^H^5£l yvGK{2s). 
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Then we have the lower-RE condition 



V LV > m 9 

2 " "2 2s 



(70) 



and the upper-RE condition 



Tt^ ^ ^\ fs^ Ml ||2 I '^min(5j3::) II ||2 /^-,\ 

V Tv < -Amax(Sj.)||u||2 H \\v\\t. (71) 

2 Zs 

Proof. This result follows easily from Lemma 12. Setting F = F — S^; and 5 = ^siai^El^ -^g 



have the bound 



Then 



I T/^ ^ \ \ ^ Anim(Sx) /|| ||2 , In ||2\ 

\v (r - j:^)v\ < — - — (||i;||2 + -\\v\h). 



v^fv>v^j:.v-^^^^^!^(\\v\\l + -\\v\\l) and 
2 s 

V Tv<v S^UH (IPII2 + 

^ s 

so the inequalities follow from Amin(Sx.)||w||2 < v'^'ExV < Xmax{^x)\\v\\2- 



□ 
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B Deviation bounds 



In this appendix, we state and prove some deviation bounds for various types of random 
matrices. 

B.l Bounds in the i.i.d. setting 

Given a zero-mean random variable y, we refer to the quantity := sup£>;^ £~-'^(E|y|^)^/^ 

as its sub-exponential parameter. The finiteness of this quantity guarantees existence of all 
moments, and hence large-deviation bounds of the Bernstein type. 

By Lemma 14 of Vershynin \l i], if X is a zero-mean sub-Gaussian random variable with 
parameter cr, then the random variable Y = X^ — E(X^) is sub-exponential with Hl'H^i < 2(T^. 
It then follows that if Xi, . . . ,X„ are zero- mean i.i.d. sub-Gaussian variables, we have the 
deviation inequality 

|- J^x2-E[X,f]| >t <2exp(-cmin(^,^)) for ah t > 0, 

where c > is a universal constant (see Proposition 16 in Vershynin [ ]). This deviation 
bound may be used to establish the following useful result: 

Lemma 14. If X £ R"^pi is a zero-mean sub-Gaussian matrix with parameters 
then for any fixed (unit) vector v € , we have 

¥[\\\Xv\\l -E[\\Xv\\l] \ > nt] < 2exp(-cnmin(^,^)). (72) 

Moreover, ifY^ ]^"-xp2 ^ zero-mean sub-Gaussian matrix with parameters (Sy,cjy), then 

X t^ t 

cov(2/i,Xi)|L^, >t) < 6pip2exp(-cnmin(— — -2,— )), (73) 

IL yux<->yj '-'x'-'y 

where Xi and Yi are the i^^ rows of X and Y , respectively. In particular, if n'^ ^ogp, then 



Y^X /logp 

cov(yi,Xi)||^^^ > cqo-^ctj^ W ) < ciexp(-C2logp). (74) 



Proof. Inequality (72) follows from the above discussion and the fact that Xv is a vector of 
i.i.d. sub-Gaussians with parameter a. In order to prove inequality (73), we first note that if 
Z is a zero-mean sub-Gaussian variable with parameter cr^, then the rescaled variable Z/a^ 
is sub-Gaussian with parameter 1. Consequently, we may assume that ax = o'y = 1 without 
loss of generality, rescaling as necessary. We then observe that 

f Y'^X 1 1 

efl — cov(yi,Xi) \ej = -[HXej + Ya) - ^{Xe^) - $(^6^)], 

II II 2 II II 2 

where we have defined $(v) := i^^^— E(ii^) . Since Xe^^Yei is sub-Gaussian with parameter 
at most 4, we may apply inequality (72) to each of the three terms, to obtain 

I T / Y'^ X \\ \ 3f 

Vi [— cov{yi,Xi))ej\ < — 
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with probability at least 1 — 6 exp ( — cn min t}) . Taking a union bound over all 1 < i < pi 

and 1 < i < P2 yields inequality (73). Finally, setting t = CQaxfTy \ and using the 
assumption n ^ logp yields inequality (74). 

□ 

We combine this lemma with a discretization argument and union bound to obtain the 
next result. For a parameter s > 1, recall the notation ]K(s) := S | 11^112 < 1, H'^llo ^ •s}- 

Lemma 15. If X G M"^p is a zero-mean sub-Gaussian matrix with parameters (S,(7^), then 
there is a universal constant c > such that 



sup — E \ \ >t 

i,eK(2s) n ^ n " 



t" t 



< 2 exp ( — cn min (^, —2) + 2s logp) . (75) 



a 



Proof. Given U C {1, . . . ,p}, define Su = {v £ W : \\v\\2 < l,supp(f) C U}, and note that 
]K(2s) = U|i/|<2s ^u- li A = {ui, . . . , Um} is a 1/3-cover of Su, then for every v S Su, there is 
some Ui £ A such that || Au||2 < ^, where Av = v — Ui. It is known [10] that we can construct 
A with 1^1 < If we define #(vi,U2) = t^f (^^ -^)v2, we have 

sup |^>(f,f)| < max |$(uj, tij)| + 2 sup | max <I>(Af , nj)| + sup |<I>(Ai), Af )|. 

v(^Su « v&Su * v<^Su 

Since 3 Aw G Su, it follows that 

2 1 

sup \^{v,v)\ < max|$(nj,tii)| + sup (-|^>(t;, t;)| + -\^{v,v)\), 

vesu * vesu -3 y 

hence sup^,g5^ |<I>(f,u)| < | maxj \^{ui,Ui)\. By Lemma 14 and a union bound, we obtain 

P( sup -E(^i^^)| >t) < g^'' • 2 exp (- cn min (^, ^)). 

v&Su n n a a 

Finally, taking a union bound over the < p^" choices of U yields 

WXvP WXvP ^ 

P( sup \- ^-E(^J ^)| > t) < 2exp(-cnmin(^,^) +2slogp), 

veK{2s) n n a a 

as claimed. □ 
B.2 Bounds for autoregressive processes 

We base our analysis of Gaussian autoregressive matrices on the following lemma: 

Lemma 16. Suppose Y £ M™" is a mixture of multivariate Gaussians Yj ~ N{0, Qj), and let 
C7^ = supj III Qj III op- Then for all t > we have 



^l"y||i-E(||y||i)| >4ta2 



n 



m(t-^? 
< 2 exp ( ^'"^ ) + 2 exp(-m/2) . 
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Proof. This result is a generalization of Lemma 1.2 in the paper [1")]. By definition, the random 
vector y is a mixture of random vectors of the form y^QjXj, where Xj ~ A^(0, For each 
index j, the function fj{x) = \\^yQjx\\2/ \^ is Lipschitz with constant |||\/Qlllop/-v/^- Since 
each Xj is Gaussian, it follows from the concentration for Lipschitz functions of Gaussians [9] 
that fj{Xj) is a sub-Gaussian random variable with parameter = |||Qj|||op/"T- Therefore, 
the mixture ||y||2/-v/n is sub-Gaussian with parameter = ^ supj |||Qj|||op- The remainder 
of the proof proceeds as in the paper [15]. 

□ 

We now specialize the preceding lemma to the cases of additive noise and missing data 
appearing in our paper. 

Lemma 17. Let X € R"^p be a Gaussian random matrix, with rows Xi generated according 
to a vector autoregression (25) with driving matrix A. Let v he a fixed vector with unit 

norm. Then for all t > , 



nit-^Y 

<2exp( -^^) +2exp(-n/2), (76) 



where 



II III op + fiM^ii"'' (additive noise case), 



op 



(1-Pmax)2 1-|||A|||, 



(missing data case). 



Proof. First consider the additive noise case, where T — Ti^ = — S^. For any fixed vector 
with ||t;||2 = 1, the variable Zv £ M" is a zero-mean Gaussian random variable with covariance 
matrix, say Q ^ 0. In order to apply Lemma 16, we need to upper-bound the spectral norm 
of Q, which we do using the elementary upper bound IQlop < max Yll=i \Qii\- For each pair 

l<j<n 

i,£ £ {1,2,..., n}, we have 

IQj^l = \ cov{eJ Zv,eJ Zv)\ = \v'^ cov{Zi, Z£)v\, 

where Zi and Zi are the i^^ and rows of Z, and ||t;||2 = 1- For i £, we have 

\v^cov{Zi,Ze)v\ = \v^cov{Xi,Xe)v\ = A\'-'\^^v\ < |||S,|||„pP|||l;-^l, 

and for i = i, we have \v'^ cov{Z^ , Z^)v\ < fEzWop < |||5^«)|||op + |||Sx.|||op. Putting together the 
pieces, we conclude that IQIiop < C^i with ( as defined in the lemma statement. Consequently, 
the bound (76) follows from Lemma 16. 

In the missing data case, the variable Zv is a zero-mean mixture of Gaussians, conditioned 
on the positions of the missing data. Suppose Z' is the random matrix Z corresponding to a 
given positioning scheme (with O's in the missing positions). We claim that 

where Qj = Cov{Z'v). Indeed, we write IQjIop < maxj Yll=i IQjM^ each pair {i,i), 

IQj. .^1 = \coY{efZ'v,eJZ'v)\ = \cov{Z''v, Z'^v)\ = \cov{Z'vi, Z^V2)\, 
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where vi and V2 are the vector v with O's in the positions corresponding to the O's of Z'^ and 
Z'^, respectively. Since | cov(Z*t;i, Z^V2)\ < for i / £ and 

I COv(Z*Vi, Z'V2)\ < |||S^|||op + W^xWop 

by a similar argument as before, the claim (77) follows. By the bounding technique (38) 
earlier in the paper, together with Lemma 16, we arrive at inequality (76). 

□ 

Lemma 18. Let X be a Gaussian matrix with rows generated from a vector autoregression 
with driving matrix A. Let s > 1. Then for all t > 



sup |t;^(f - S^)u| > 4tC^ 

i;eK(2s) 



with C CLS defined in Lemma 17. 



2 

< 4exp ( - cn min {{t — )^, l) + 2slogp), (78) 



n 



Proof. We use the single-deviation bounds from Lemma 17, together with a discretization 
argument identical to that of Lemma 15. 

□ 
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